Large Scale Data Mining: Challenges and Responses
نویسندگان
چکیده
Data mining over large data-sets is important due to its obvious commercial potential, However, it is also a major challenge due to its computational complexity. Exploiting the inherent parallelism of data mining algorithms provides a direct solution by utilising the large data retrieval and processing power of parallel architectures. In this paper, we present some results of our intensive research on paralielising data mining algorithms. In particular, we also present a methodology for determining the proper parallelisation strategy based on the idea of algorithmic skeletons and performance modelling. This research aims to provide a systematic way to develop parallel data mining algorithms and applications. ’ Parallelism in Data Mining Algorithms Numerous algorithms have been previously developed for data mining (Fayyad, Piatetsky-Shapiro, & Smyth , nnc\ n-1...:-1. A ̂ ,....^^ -P :.,L.,,,,* -nnn ,,,,;nm ^“;,+;,,,. IYYU). I Ilt: llCll uc:grcr; “I IIIIItxGIIb palculGlmul LuDUllltj in these algorithms allows some flexibility in choosing the parallelisation scheme that most suited for a particular parallel machine. Two major methods for exploiting parallelism within data mining algorithms can be identified as task parallelism and dafa parallelism. A brief summary of how the two approaches can be used to parallelise different data mining algorithms is given in Table 1. In the task parallelism approach the computation is partitioned amongst the processors of a parallel machine with each processor computing a distinct part of a learning model before co-ordinating with the other processors to form the global model. In the data parallelism approach the training set is partitioned amongst the processors with all processors synchronously constructing the same model; each operating on a different portion of the data-set. Our practical experience with parallelising data mining algorithms showed an interesting phenomenon. While the parallelisation of certain data mining algorithms shows a consistent performance behaviour when ’ &pyJvrieht fi31997. .A_r?leric~A_ssc&&ion for Artificial o-sv> Intelligence (www.aaai.org). All rights reserved. Algorithm
منابع مشابه
Parallel and Distributed Data Mining: An Introduction
The explosive growth in data collection in business and scientific fields has literally forced upon us the need to analyze and mine useful knowledge from it. Data mining refers to the entire process of extracting useful and novel patterns/models from large datasets. Due to the huge size of data and amount of computation involved in data mining, high-performance computing is an essential compone...
متن کاملChapter 9 MINING TEXT STREAMS
The large amount of text data which are continuously produced over time in a variety of large scale applications such as social networks results in massive streams of data. Typically massive text streams are created by very large scale interactions of individuals, or by structured creations of particular kinds of content by dedicated organizations. An example in the latter category would be the...
متن کاملAutomatic Discovery of Technology Networks for Industrial-Scale R&D IT Projects via Data Mining
Industrial-Scale R&D IT Projects depend on many sub-technologies which need to be understood and have their risks analysed before the project can begin for their success. When planning such an industrial-scale project, the list of technologies and the associations of these technologies with each other is often complex and form a network. Discovery of this network of technologies is time consumi...
متن کاملTowards Web Search Engine Scale Data Mining
Data mining is one of the most critical driving technologies behind Web search engines. Web search engine scale data mining posts many grand challenges, ranging from efficiency and scalability to diversity and adaptability. In this talk, I will review our recent effort on mining a very large amount of data accumulated in one of the major commercial search engines. Particularly, we tackle the pr...
متن کاملChallenges in Parallel Graph Processing
Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for developing mainstream parallel scientific applications...
متن کاملStream Data Mining: Platforms, Algorithms, Performance Evaluators and Research Trends
Streaming data are potentially infinite sequence of incoming data at very high speed and may evolve over the time. This causes several challenges in mining large scale high speed data streams in real time. Hence, this field has gained a lot of attention of researchers in previous years. This paper discusses various challenges associated with mining such data streams. Several available stream da...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997